Skip to content

[Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses.#27123

Merged
mgoin merged 16 commits intovllm-project:mainfrom
neuralmagic:swap-quant-method
Nov 4, 2025
Merged

[Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses.#27123
mgoin merged 16 commits intovllm-project:mainfrom
neuralmagic:swap-quant-method

Conversation

@bnellnm
Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm commented Oct 17, 2025

Purpose

Make a new FusedMoEModularMethod subclass of FusedMoeMethodBase for use with modular kernels.

Instead of having every subclass of FusedMoEMethodBase check self.fused_experts, we swap out the quant_method of the FusedMoE layer to an instance of FusedMoEModularMethod. This will reduce the complexity of the various FusedMoEMethodBase subclass apply methods and isolate uses of modular kernels to the new class.

Test Plan

Ran by hand on some fp8 + modelopt models.
CI tests

Test Result

cc @varun-sundar-rabindranath , @wenscarl

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Oct 17, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/474381baec872bc9f45e221754d420f41b93ace0/vllm/model_executor/layers/fused_moe/layer.py#L2115-L2118
P0 Badge Accessing missing fused_experts attribute

The commit removes fused_experts from FusedMoEMethodBase, but FusedMoE still unconditionally accesses self.quant_method.fused_experts here (and again later when staging tokens). When the quant method does not use a modular kernel—e.g. AWQ, BitsAndBytes, RTN—init_prepare_finalize now leaves the original quant method in place and it no longer defines a fused_experts attribute. These checks will therefore raise AttributeError before any routing happens. The guard should use hasattr or using_modular_kernel instead of dereferencing the attribute directly.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the handling of modular kernels for Fused MoE layers by introducing a FusedMoEModularMethod wrapper. This is a good simplification that centralizes logic. However, I've identified two critical issues that could lead to runtime errors. One is related to an incorrect condition for EPLB support in the FP8 quantization method, and the other is an incorrect API usage for submodule replacement. I have provided detailed comments and suggestions to address these issues.

Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Comment thread vllm/model_executor/layers/quantization/fp8.py
@mergify mergify Bot removed the needs-rebase label Oct 18, 2025
@bnellnm bnellnm changed the title [Kernels] Swap quant method [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. Oct 20, 2025
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Oct 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Comment thread vllm/model_executor/layers/fused_moe/layer.py
Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/layer.py
Comment thread vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/layer.py Outdated
@varun-sundar-rabindranath
Copy link
Copy Markdown
Contributor

Thanks @bnellnm . This cleans up a bunch of redundant code 🙌 .

I have a suggestion. IIUC, the function call chain for the construction of FusedMoEModularMethod looks something like follows,

1. `DeviceCommunicatorBase::prepare_communication_buffer_for_model()` calls  `FusedMoE::init_prepare_finalize()` 
2. `FusedMoE::init_prepare_finalize()` calls `FusedMoEMethodBase::init_prepare_finalize()` and it returns the `FusedMoEModularKernel` object
3. `FusedMoE::init_prepare_finalize()` then makes a `FusedMoEModularMethod` object and overrides its `self.quant_method`

Here, note that FusedMoEMethodBase::init_prepare_finalize() calls FusedMoEMethodBase::maybe_make_prepare_finalize() which in turn calls a static function FusedMoEMethodBase::_maybe_make_prepare_finalize() which does most of the work anyways.

My suggestion is to move FusedMoEModularMethod into its own file and expose a function say maybe_make_fused_moe_modular_method() that attempts to construct the FusedMoEModularMethod object.

That way, we can get rid of most of the ModularKernel specific code from fused_moe/layer.py and have it in a different file thus cleaning up fused_moe/layer.py greatly.

What do you think ?

I am not suggesting we do it in this PR. I can take it up as well 👍

@bnellnm
Copy link
Copy Markdown
Collaborator Author

bnellnm commented Oct 24, 2025

Thanks @bnellnm . This cleans up a bunch of redundant code 🙌 .

I have a suggestion. IIUC, the function call chain for the construction of FusedMoEModularMethod looks something like follows,

1. `DeviceCommunicatorBase::prepare_communication_buffer_for_model()` calls  `FusedMoE::init_prepare_finalize()` 
2. `FusedMoE::init_prepare_finalize()` calls `FusedMoEMethodBase::init_prepare_finalize()` and it returns the `FusedMoEModularKernel` object
3. `FusedMoE::init_prepare_finalize()` then makes a `FusedMoEModularMethod` object and overrides its `self.quant_method`

Here, note that FusedMoEMethodBase::init_prepare_finalize() calls FusedMoEMethodBase::maybe_make_prepare_finalize() which in turn calls a static function FusedMoEMethodBase::_maybe_make_prepare_finalize() which does most of the work anyways.

My suggestion is to move FusedMoEModularMethod into its own file and expose a function say maybe_make_fused_moe_modular_method() that attempts to construct the FusedMoEModularMethod object.

That way, we can get rid of most of the ModularKernel specific code from fused_moe/layer.py and have it in a different file thus cleaning up fused_moe/layer.py greatly.

What do you think ?

I am not suggesting we do it in this PR. I can take it up as well 👍

Yeah, that's a good idea. I was also considering splitting up layer.py in different ways, e.g. move UnquantizedMoEMethod to a separate file.

I'd rather do that in a separate PR though.

Comment thread vllm/model_executor/layers/fused_moe/layer.py
Comment thread vllm/model_executor/layers/quantization/bitsandbytes.py
if layer.w2_weight is None
else layer.w2_weight
)
assert all([w is not None for w in [layer.w13_weight, layer.w2_weight]])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this setting of layer.w13_weight and layer.w2_weight better fits in the process_weights_after_loading function here


That way we can get rid of having to differentiate between the w13_weight_triton_tensor/w2_weight_triton_tensor and w13_weight/w2_weight .

Not suggesting for this PR. Fixing that I think should be its own PR.

Copy link
Copy Markdown
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Very nice cleanups ! Thanks @bnellnm

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
@mgoin mgoin merged commit 938772a into vllm-project:main Nov 4, 2025
61 checks passed
@wangshangsam
Copy link
Copy Markdown
Collaborator

wangshangsam commented Nov 6, 2025

@bnellnm I have a ... maybe dumb ... question - how exactly is each derived MoEMethod class going to trigger FusedMoEModularMethod.apply() (thereby using the modular kernels)? Doesn't each subclass override the .apply() completely?

@bnellnm
Copy link
Copy Markdown
Collaborator Author

bnellnm commented Nov 6, 2025

@bnellnm I have a ... maybe dumb ... question - how exactly is each derived MoEMethod class going to trigger FusedMoEModularMethod.apply() (thereby using the modular kernels)? Doesn't each subclass override the .apply() completely?

The FusedMoE layer calls self.quant_method.apply so if no modular kernel has been constructed, this will invoke an apply method on some subclass of FusedMoEMethodBase. Now, when a modular kernel gets created, the FusedMoE layer will swap out self.quant_method with an instance of FusedMoEModularMethod which will call the modular kernel instead.

So, subclasses of FusedMoEMethodBase no longer need to worry about modifying apply for modular kernels.

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
@bnellnm bnellnm deleted the swap-quant-method branch November 11, 2025 19:43
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants